Let the general form of the simple linear model:
\[y_i = \beta_0 + \beta_1 x_{i,1} + \epsilon_i, \ i = 1,..n \\ \mathrm{such \ that } \ \epsilon \sim \mathrm{Normal}(0,\sigma^2)\]
used to describe the relationship between a response y and a set of variables or predictors x
Data
Student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austins, including information on the professor evaluation score and a the age of each professor.
Here, we will fit a simple linear regression model where the explanatory variable is categorical, i.e. a factor with two or more levels
\[ y_{i} = \alpha + \beta_{\text{male}} \cdot \mathbb{I}_{\text{male}}(x) + \epsilon_i \]
\[\mathbb{I}_{\text{male}}(x)=\left\{ \begin{array}{ll} 1 ~~~ \text{If gender} ~ x ~ \text{is male},\\ 0 ~~~ \text{Otherwise}.\\ \end{array} \right.\]
Let the general form of the linear model:
\[y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + ... + \beta_p x_{i,p} + \epsilon_i, \ i = 1,..n\]
\(y_i\) is our response of the \(i^{th}\) observation;
\(\beta_0\) is the intercept;
\(\beta_1\) is the coefficient for the first explanatory variable \(x_1\);
\(\beta_2\) is the coefficient for the second explanatory variable \(x_2\);
\(\beta_p\) is the coefficient for the \(p^{th}\) explanatory variable \(x_p\);
\(\epsilon_i\) is the \(i^{th}\) random error component.
Last week we covered the case with two numerical explanatory variables. This week, we will have a look at the case with one numerical and one categorical explanatory variable.
We already explored the relationship between teaching score (score) and age (age). Now, let’s also introduce the additional (binary) categorical explanatory variable gender (gender).
\(\alpha\) is the intercept of the regression line for females;
\(\beta_{\text{age}}\) is the slope of the regression line for both males and females;
\(\beta_{\text{male}}\) is the additional term added to \(\alpha\) to get the intercept of the regression line for males; and
\(\mathbb{I}_{\text{male}}(x)\) is an indicator function such that takes the value of 1 if the person is a male and zero otherwise.
evals_multiple <- evals %>%
select(score, gender, age)
par.model <- lm(score ~ age + gender, data = evals_multiple)
tab_model(par.model)| score | |||
| Predictors | Estimates | CI | p |
| (Intercept) | 4.48 | 4.24 – 4.73 | <0.001 |
| age | -0.01 | -0.01 – -0.00 | 0.001 |
| gender [male] | 0.19 | 0.09 – 0.29 | <0.001 |
| Observations | 463 | ||
| R2 / R2 adjusted | 0.039 / 0.035 | ||
Hence, the regression line for females is given by:
\[\widehat{\text{score}} = 4.48 - 0.009 \cdot \text{age},\]
while the regression line for males is given by:
\[\widehat{\text{score}} = 4.48 - 0.009 \cdot \text{age} + 0.191 = 4.671 - 0.009 \cdot \text{age}.\]
From the parallel regression lines model the associated effect of age on teaching score is the same for both men and women.
For every one year increase in age, there is an associated decrease in teaching score of 0.009.
Male instructors have a higher intercept terms. This is linked to the average difference in teaching scores that males obtain relative to females.
In this model, the effect of age on teaching scores will differ by gender.
where \(\beta_{\text{age, male}} \cdot \text{age} \cdot \mathbb{I}_{\text{male}}(x)\) corresponds to the interaction term.
More concretely:
\[y_{i}=\left\{ \begin{array}{ll} \alpha + \beta_{\text{age}} \cdot \text{age} + \epsilon_i ~~~ \text{If gender} ~ x ~ \text{is female},\\ (\alpha + \beta_{\text{male}}) + (\beta_{\text{age}} + \beta_{\text{age, male}}) \cdot \text{age} + \epsilon_i ~~~ \text{Otherwise}.\\ \end{array} \right.\]
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.88 0.205 23.8 4.02e-82
2 age -0.0175 0.00447 -3.92 1.03e- 4
3 gendermale -0.446 0.265 -1.68 9.35e- 2
4 age:gendermale 0.0135 0.00553 2.45 1.48e- 2
The regression line for females is given by:
\[ \widehat{\text{score}} = 4.88 - 0.018 \cdot \text{age}, \]
The regression line for males is given by:
\[\widehat{\text{score}} = 4.88 - 0.018 \cdot \text{age} - 0.446 + 0.014 \cdot \text{age} = 4.434 - 0.004 \cdot \text{age}.\]
Now we have to assess the fit of the model by looking at plots of the residuals.
A confidence interval gives a range of plausible values for a population parameter.
It depends on a specified confidence level.
Instead of estimating an unknown parameter by using a single point estimate/sample statistic we can use a range of possible values based around our sample statistic to make a plausible guess as to the location of the parameter.
| parallel lines model | interaction model | |||||||
| Predictors | Estimates | std. Error | Statistic | p | Estimates | std. Error | Statistic | p |
| (Intercept) | 4.48 (4.24 – 4.73) |
0.13 | 35.79 | <0.001 | 4.88 (4.48 – 5.29) |
0.21 | 23.80 | <0.001 |
| age | -0.01 (-0.01 – -0.00) |
0.00 | -3.28 | 0.001 | -0.02 (-0.03 – -0.01) |
0.00 | -3.92 | <0.001 |
| gender [male] | 0.19 (0.09 – 0.29) |
0.05 | 3.63 | <0.001 | -0.45 (-0.97 – 0.08) |
0.27 | -1.68 | 0.094 |
| age × gender [male] | 0.01 (0.00 – 0.02) |
0.01 | 2.45 | 0.015 | ||||
| Observations | 463 | 463 | ||||||
| R2 / R2 adjusted | 0.039 / 0.035 | 0.051 / 0.045 | ||||||
The tables includes:
A confidence interval gives a range of plausible values for a population parameter.
When there is more than one explanatory variable in a model, the parameter associated with each explanatory variable is interpreted as the change in the mean response based on a 1-unit change in the corresponding explanatory variable keeping all other variables held constant.
When interpreting the confidence intervals of each parameter by acknowledging that each are plausible values conditional on all the other explanatory variables in the model.
We will introduce some of the ideas in the simple case where we have 2 potential explanatory variables (\(x_1\) and \(x_2\)) and use confidence intervals to decide which variables will be useful in predicting the response variable (\(y\)).
One approach is to consider a hierarchy of models:
\[y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i}\]
\[y_i = \alpha + \beta_1 x_{1i} \qquad \qquad y_i = \alpha + \beta_2 x_{2i}\]
\[y_i = \alpha\]
Within this structure we might take a top-down approach:
Recall that as well as age and gender, there is also a potential explanatory variable bty_avg in the evals data, i.e. the numerical variable of the average beauty score from a panel of six students’ scores between 1 and 10.
| score | |||
| Predictors | Estimates | std. Error | p |
| (Intercept) | 4.05 (3.72 – 4.39) |
0.17 | <0.001 |
| age | -0.00 (-0.01 – 0.00) |
0.00 | 0.251 |
| bty avg | 0.06 (0.03 – 0.09) |
0.02 | <0.001 |
| R2 / R2 adjusted | 0.038 / 0.034 | ||
Question
Which variable should we drop?
When the number of potential explanatory variables is large the problem of selecting which variables to include in the final model becomes more difficult.
When we do model selection we compromise:
There are many objective criteria for comparing different models applied to the same data set.
All of them trade off the two objectives above, i.e. fit to the data against complexity. Common examples include:
The \(R^2_{adj}\) values, i.e. the proportion of the total variation of the response variable explained by the models.
\[R_{adj}^2 = 1 - \frac{RSS/(n-p-1)}{SST/(n-1)} = 100 \times \Bigg[ 1-\frac{\sum_{i=1}^n(y_i-\widehat{y}_i)^2/(n-p-1)}{\sum_{i=1}^n(y_i-\bar y_i)^2/(n-1)}\Bigg]\]
Akaike’s Information Criteria (AIC)
\[AIC = -2 \cdot \text{log-likeihood} + 2p = n\ln\left(\frac{RSS}{n}\right)+2p\]
stepAIC function from the MASS library.Bayesian Information Criteria
\[BIC = -2 \cdot \text{log-likeihood} +\ln(n)p\]
A popular data analysis strategy that can be adopted is to calculate \(R_{adj}^2\), \(AIC\) and \(BIC\) and compare the models which minimise the \(AIC\) and \(BIC\) with the model that maximises the \(R_{adj}^2\).
To illustrate this, let’s return to the evals data and the MLR on the teaching evaluation score score with the two continuous explanatory variables age and bty_avg and compare this with the SLR model with just bty_avg.
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0115 0.00931 0.541 5.34 0.0213 1 -372. 750. 762.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0350 0.0329 0.535 16.7 0.0000508 1 -366. 738. 751.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0378 0.0336 0.535 9.03 0.000142 2 -366. 739. 756.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
A great deal of care should be taken in selecting explanatory variables for a model because the values of the regression coefficients depend upon the variables included in the model.
One thing NOT to do is select hundreds of random predictors, bung them all into a regression analysis and hope for the best.
There are automatic strategies, such as Stepwise and Best Subsets regression, based on systematically searching through the entire list of variables not in the current model to make decisions on whether each should be included.
These strategies need to be handled with care,
Our best strategy is a mixture of judgement on what variables should be included as potential explanatory variables, together with an interval estimation and hypothesis testing strategy for assessing these. The judgement should be made in the light of advice from the problem context.
Important
Golden rule for modelling
The key to modelling data is to only use the objective measures as a rough guide. In the end the choice of model will involve your own judgement. You have to be able to defend why you chose a particular model.